The notebook deployment includes Spark automatically within each Python notebook kernel. This means that, upon kernel instantiation, there is an SparkContext object called sc
immediatelly available in the Notebook, as in a PySpark shell. Let's take a look at it:
In [1]:
?sc
We can inspect some of the SparkContext properties:
In [1]:
# Spark version we are using
print sc.version
In [3]:
# Name of the application we are running
print sc.appName
In [4]:
# Some configuration variables
print sc.defaultParallelism
print sc.defaultMinPartitions
In [3]:
# Username running all Spark processes
# --> Note this is a method, not a property
print sc.sparkUser()
In [2]:
# Print out the SparkContext configuration
print sc._conf.toDebugString()
In [7]:
# Another way to get similar information
from pyspark import SparkConf, SparkContext
SparkConf().getAll()
Out[7]:
We can also take a look at the Spark configuration this kernel is running under, by using the above configuration data:
In [8]:
print sc._conf.toDebugString()
... this includes the execution mode for Spark. The default mode is local, i.e. all Spark processes run locally in the launched Virtual Machine. This is fine for developing and testing with small datasets.
But to run Spark applications on bigger datasets, they must be executed in a remote cluster. This deployment comes with configuration modes for that, which require:
spark-notebook
script, such as
sudo service spark-notebook set-addr <master-ip> <namenode-ip> <historyserver-ip>
sudo service spark-notebook set-mode (local | standalone | yarn)
These operations can also be performed outside the VM by telling vagrant to relay them, e.g.
vagrant ssh -c "sudo service spark-notebook set-mode local"
In [1]:
from operator import add
l = sc.parallelize( xrange(10000) )
print l.reduce( add )
In [ ]: